Shannon Code

Assigning short codewords to more likely source symbols using CDF.

Essence

$$ \begin{aligned} \ell(x) = \left\lceil \log \left( \frac{1}{p(x)} \right) \right\rceil \Rightarrow \left\lceil \log \left( \frac{1}{p(x)} \right) \right\rceil + 1 \text{(Just To Be Safe)} \end{aligned} $$

Formula

Assuming there is an order on the source alphabet $ \mathcal{X} $. The CDF of source distribution: $$ \begin{aligned} F(x) = \sum_{k < x} p(k) \end{aligned} $$ Define $ \overline{F}(x) $ as:

$$ \begin{aligned} \overline{F}(x) = \frac{F(x - 1) + F(x)}{2} = F(x-1) + \frac{p(x)}{2} \end{aligned} $$

Then $ \overline{F}(x) \in (0, 1) $ is a real number that uniquely identifies $ x $.

The codeword $ C(x) $ corresponds to the D-ary expansion of the real number $ \overline{F}(x) $, truncated according to the length function. It should satisfy: $$ \begin{aligned} |C(x) - \overline{F}(x)| &\leq \frac{p(x)}{2} \end{aligned} $$

Property

The code obeys: $$ \begin{aligned} \mathbb{E} \left[ \ell(X) \right] &< \frac{H(X)}{\log D} + 2 \end{aligned} $$ The length function is good because it implies: $$ \begin{aligned} \ell(x) &= \left\lceil \log_D \left( \frac{1}{p(x)} \right) \right\rceil + 1 \\ |C(x) - \overline{F}(x)| &\leq D^{-\ell(x)} \leq \frac{p(x)}{D} \leq \frac{p(x)}{2} &\text{[Satisified.]} \end{aligned} $$

by Jon